K-Nearest Neighbours: A Basic Algorithm, But Brilliant When Used Right
Authors
Ondřej Duba
Filip Molhanec
Published
March 10, 2025
Introduction
There are many machine learning models that are widely used to solve supervised learning problems—many of them are highly powerful but come at cost of being complex and difficult to understood. Then there’s K-Nearest Neigbours (KNN), often regarded as one of the simplest machine learning algorithms. Despite its simplicity, KNN is quite powerful when applied in the right scenarios. As one of the core algorithms for supervised learning and its utility and power are undeniable
What is KNN?
As a quick reminder, supervised learning is process of creating model that can predict value of target variable based on input data, using knowledge from dataset where we know the actual values of the target variables. KNN can be effectively used for both classification (target variable can take a limited number of values) and regression (target variable can take on continuous range of values) tasks. The simplest explanation of KNN for classification tasks is that an object is classifed by plurality vote of its k nearest neighbours. For regression tasks, KNN can be generalized so that an object is assigned value that is the average of the values of its k nearest neigbours. However, this is just a basic overview and there are several factors to consider when using KNN to develop an effective machine learning model.
A simple visualisation of K-Nearest Neighbours (KNN) can be seen in image below (Figure 1). The yellow point represents the uknown data point that we are trying to classify. We make the prediction considering its closest k neighbours. For k = 3, the three closest points to the yellow point are two blue points and one red points. Since the majority of the neighbours are blue, would predict that the yellow point belongs to the blue class.
Figure 1: K-Nearest Neighbours Visualisation Ferreira et al. (2022)
Dataset description
To make the explanation clearer, we’ll demonstrane how KNN works by using specific dataset. Before diving into training and prediction let’s briefly review the dataset that we chose. The dataset that we selected is Air Quality and Pollution Assesment. This Dataset is derived from World Health Organization and World Bank Group This Dataset contains several features, in other words, columns, lets go through each one of them and explain what they mean.
Temperature(°C): Average temperature of the region
Humidity (%): Relative humidity recorded in the region
PM2.5 Concentration (µg/m³): Fine particulate matter level
Proximity to Industrial Areas (km): Distance to the nearest industrial zone
Population Density (people/km²): Number of people per square kilometer in the region
Then there is so called Target Variable, that’s the variable that we are trying to predict, in our dataset, it is called Air Quality and it can have 4 possible values depending on Air Quality, these values are the following:
Good: Clean air with low pollution levels.
Moderate: Acceptable air quality but with some pollutants present.
Poor: Noticeable pollution that may cause health issues for sensitive groups.
Hazardous: Highly polluted air posing serious health risks to the population.n.
How does KNN work?
We are trying to solve supervised learning problem, where based on p features \(X_{1}, ..., X_{p}\) (Temperature, Humidity, …) we want to predict the value of target variable \(Y\) (Air Quality). We can put all of these features into vector \(\textbf{X} = (X_{1}, ... X{_p})^T\) which we will interpret as a random vector, and one of its specific realization will be denoted as \(\mathbf{x}\). Let \(\mathcal{X}\) where \(\textbf{x} \in \mathcal{X}\) be set containing all possible values of these features, typically, and in our case, \(\mathcal{X} = \mathbb{R}^p\). Then we can can describe our training dataset as N pairs \((\textbf{x}_{1}, Y_{1}), ..., (\textbf{x}_{N}, Y_{N})\) where \(\textbf{x}_{i}\) is vector of features and \(Y_{i}\) is target variable. The basic concept of predicting value of data point \(\mathbf{x} \in \mathcal{X}\) is finding the k-closest points from it (these points come from the train dataset). If we are solving regression problems, we then take the average from the target varibles of the k-closest neighbours. In classification problems (that’s our case) we take the most frequent value from the target variables of the k-closest neighbours.
When trying to find the best model, there are many things to consider. First of all we should define distance function. Distance also called metric defined on set \(\mathcal{X}\) is function d: \(\mathcal{X} \times \mathcal{X} \rightarrow [0, +\infty)\) such that for all \(x, y, z \in \mathcal{X}\), the following conditions holds:
\(\text{i) } d(x,y) \geq 0\), and \(d(x,y) = 0\) if and only if \(x = y\) (positive definiteness) \(\text{ii) }d(x,y) = d(y,x)\) (symmetry) \(\text{iii) }d(x,y) \geq d(x,z) + d(z,y)\) (triangle inequality)
The most common distance function used is KNN is Minkowski distance also called q-norm or l-metric defined as follows: \(||x-y||_{q} = d_{q}(x,y) = \sqrt[q]{\sum_{i=1}^{p} (x_i - y_i)^q}\) Example of L1 and L2 distance Figure 2
After we have found k-nearest neighbours, we need to somehow decide what value we should predict. In our case, we could just precict the value with the most frequent target variable. What we can also do is try weighted voting, that means that closer neighbours will have vote with higher value.
Common choice for weight is inverse distance weighting \(w_{i} = \frac{1}{d(x,y)}\)
For each category \(c\) that target variable can be, we count the total weighted vote \(W_{c} = \sum_{Y_i \in \mathcal{N}} w_i \cdot 1(Y_i = c)\)
where \(\mathcal{N}\) is set of target variables of k-nearest neighbours. We then choose \(\hat{y} = \arg\max_c W_c\), where \(c\) is every possible category of target variable, as our prediction.
What we need to do first is to split the data into 3 smaller datasets. Train, validate and test with train size being 60% of the original dataset and validate and test both being 20% of the original dataset, this will play key role in the training. We will use the train dataset to train the model and then the validate dataset to find the best hyperparameters. Finally, the test dataset will will serve as an option to measure how good our model is.
Finding the Best model (Hyperparameter tuning)
When finding the best model, there are many things to consider. Our goal is to somehow maximize the classification accuracy on new data, that the model has never seen before. Accuracy is defined as \[\text{accuracy = }\frac{ \text{number of correctly classified}}{ \text{number of all classified}} \in (0,1)\] The accuracy which we are trying to maximize is the accuracy of how our model built using train dataset will predict on validating dataset. In order to maximize this accuracy, we can try changing the number of neighbours (k), then we can try changing the distance metric, lastly, we can try whether using weighted distance makes any difference. We build our model using combination of these different possibilites and for every one of them we meassure the accuracy on validating dataset. We then choose the model with the highest accuracy.
The results
Using no normalization, the best accuracy we get is 0.833 with hyperparameters k (number of neighbours) = 8, weights = distance (closer points to our data point have higher value than points further away) and finally metric = 1 (Manhattan metric)
Code
# Importing all libraries that will be usedimport numpy as npimport matplotlib.pyplot as pltimport pandas as pdfrom sklearn.model_selection import train_test_split, ParameterGridfrom sklearn import metricsimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.preprocessing import StandardScaler, MinMaxScalerimport plotly.express as pximport ipywidgets as widgetsfrom ipywidgets import interactiveimport seaborn as snsimport matplotlib.patheffects as path_effectsdef preprocess_data(df:pd.DataFrame)->pd.DataFrame:""" Function, for preprocessing data """ qual_category = pd.api.types.CategoricalDtype(categories=['Hazardous', 'Poor', 'Moderate', 'Good'], ordered=True) df['Air Quality'] = df['Air Quality'].astype(qual_category)return dfdef read_data(path:str='data/data.csv', y:str='Air Quality',**kwargs)->tuple:""" Function thats read data, and splits them into Train, Validation and Test datasets also separates, target value from others values. --- Attributes: path: [str], path to csv data file y: [str], name of Target value kwargs: options, use seed for random_seed --- Returns: tuple with Train,Validation,Test parametrs set, Target values: Train, Test, Validation """ df = pd.read_csv(path)#display(df.info())#display(df.describe()) df = preprocess_data(df)# Split the training dataset into train and rest (default 60% : 40%) Xtrain, Xrest, ytrain, yrest = train_test_split( df.drop(columns=[y]), df[y], test_size=0.4, random_state=kwargs.get('seed',42))# Split the rest of the data into validation dataset and test dataset (default: 24% : 16%) Xtest, Xval, ytest, yval = train_test_split( Xrest, yrest, test_size=0.5, random_state=kwargs.get('seed',42))#print(f'Dataset: {path} | Target value: {y} | Seed: {kwargs.get('seed',42)}')return Xtrain, Xtest, Xval, ytrain, ytest, yvaldef train_model(*args,**kwargs):"""Function, that trains model with specific paraketrs, defined in kwargs. --- Attributes: *args: Xtrain,Ytrain **kwargs: options for training model --- Return: trained model """ X = args[0] y = args[1]return KNeighborsClassifier(**kwargs).fit(X,y)Xtrain, Xtest, Xval, ytrain, ytest, yval = read_data(seed=42)def find_best(*args):""" Function finding best parametrs for specified data --- Attributes: Xtrain, ytrain, Xval, Yval """ param_grid = {'n_neighbors': range(1,15), 'p': range(1,4),'weights': ['uniform', 'distance'] } param_comb = ParameterGrid(param_grid) val_acc = [] train_acc = []for params in param_comb: clfKNN = KNeighborsClassifier(**params) clfKNN.fit(args[0], args[1]) train_acc.append(clfKNN.score(args[0],args[1])) val_acc.append(clfKNN.score(args[2],args[3]))return param_comb[np.argmax(val_acc)], train_acc, val_accbest_params_no_scale, train_acc, val_acc = find_best(Xtrain,ytrain,Xval,yval)scaler = StandardScaler()Xtrain_fit = scaler.fit_transform(Xtrain)Xval_fit = scaler.transform(Xval)best_params_standard, train_acc_standard, val_acc_standard = find_best(Xtrain_fit,ytrain,Xval_fit,yval)scaler = MinMaxScaler()Xtrain_fit = scaler.fit_transform(Xtrain)Xval_fit = scaler.transform(Xval)best_params_minmax, train_acc_minmax, val_acc_minmax = find_best(Xtrain_fit,ytrain,Xval_fit,yval)clfKNN = KNeighborsClassifier(**best_params_no_scale)clfKNN.fit(Xtrain, ytrain)y_knn_val = clfKNN.predict(Xval)#display(metrics.accuracy_score(yval, y_knn_val))fig, ax = plt.subplots(figsize=(15,5))ax.set_title("Model accuracy on Training and Validating datasets based on hyperparameter index (no normalization)", fontsize=17)ax.plot(train_acc,'or-')ax.plot(val_acc,'ob-')max_idx = np.argmax(val_acc)ax.plot(max_idx, val_acc[max_idx], marker='o', color='#77DD77', markersize=13, markeredgecolor='black', markeredgewidth=2)ax.set_ylim(bottom=0.67, top=1.02)ax.set_xlabel('Hyperparameter index', fontsize=13)ax.set_ylabel('Accuracy', fontsize=13)ax.legend(['Train', 'Validate', 'Best validate accuracy'], loc='lower right')print("Best Hyperparameter values: ")display(best_params_no_scale)
Best Hyperparameter values:
{'weights': 'distance', 'p': 1, 'n_neighbors': 8}
Normalization
One technique that can also help with making the model more accurate is normalization. In some cases, the features are difficult to compare and it is essentially impossible to find a universal measure. This can be addressed by using clever methods to normalize the data using linear transformation. The simplest method used to normalize data is called Min-Max normalization which scales every element in selected feature into interval \([0,1]\). It is defined as follows:
For given feature let’s call its minimal value \(\text{min}_x\) and the maximal value \(\text{max}_x\), then we will define the value \(x_i\) of this feature as \(x_i \leftarrow \frac{ x_i - \text{min}_x}{ \text{max}_i - \text{min}_x}\)
Another commonly used normalization method is standardization, defined as follows \(x_i \leftarrow \frac{x_i - \bar{x}}{\sqrt{s_x^2}}\) where \(\bar{x} = \frac{1}{n} \sum_{i} x_i\) is sample mean and \(s_x^2 = \frac{1}{n-1} \sum_{i}^{} {(x_i - \bar{x})^2}\) is sample variance
Standardization accuracy
Using Standardization, the best accuracy we get is 0.931
Code
scaler = StandardScaler()Xtrain_fit = scaler.fit_transform(Xtrain)Xval_fit = scaler.transform(Xval)clfKNN = KNeighborsClassifier(**best_params_standard)clfKNN.fit(Xtrain_fit, ytrain)y_knn_val = clfKNN.predict(Xval_fit)display(metrics.accuracy_score(yval, y_knn_val))fig, ax = plt.subplots(figsize=(15,5))ax.set_title("Model accuracy on Training and Validating datasets based on hyperparameter index (standardization)", fontsize=17)ax.plot(train_acc_standard,'or-')ax.plot(val_acc_standard,'ob-')max_idx = np.argmax(val_acc_standard)#display(val_acc_standard[max_idx])ax.plot(max_idx, val_acc_standard[max_idx], marker='o', color='#77DD77', markersize=13, markeredgecolor='black', markeredgewidth=2)ax.set_xlabel('Hyperparameter index', fontsize=13)ax.set_ylabel('Accuracy', fontsize=13)ax.legend(['Train', 'Validate', 'Best validate accuracy'])
0.931
Min-Max accuracy
Using Min-Max normalization, the best accuracy we get is 0.942
Code
scaler = MinMaxScaler()Xtrain_fit = scaler.fit_transform(Xtrain)Xval_fit = scaler.transform(Xval)clfKNN = KNeighborsClassifier(**best_params_minmax)clfKNN.fit(Xtrain_fit, ytrain)y_knn_val = clfKNN.predict(Xval_fit)display(metrics.accuracy_score(yval, y_knn_val))fig, ax = plt.subplots(figsize=(15,5))ax.set_title("Model accuracy on Training and Validating datasets based on hyperparameter index (min-max normalization)", fontsize=17)ax.plot(train_acc_minmax,'or-')ax.plot(val_acc_minmax,'ob-')max_idx = np.argmax(val_acc_minmax)#display(val_acc_minmax[max_idx])ax.plot(max_idx, val_acc_minmax[max_idx], marker='o', color='#77DD77', markersize=13, markeredgecolor='black', markeredgewidth=2)ax.set_xlabel('Hyperparameter index', fontsize=13)ax.set_ylabel('Accuracy', fontsize=13)ax.legend(['Train', 'Validate', 'Best validate accuracy'])
0.942
Choosing the final model
Based on the results, we can see that the best model is the model that uses Min-Max normalization, uses weighted distance, 3-norm and 3 neighbours with accuracy of 0.942 We can graph confusion matrix, that shows how accurately we predicted each type of category. Finally, we will predict the expected accuracy on the test dataset, which will give us an estimate of how accurately the model will perform on new, unseen data.
{'weights': 'distance', 'p': 3, 'n_neighbors': 3}
Code
max_idx = np.argmax(val_acc)max_idx_standard = np.argmax(val_acc_standard)max_idx_minmax = np.argmax(val_acc_minmax)labels = ['without_normalization', 'standardization', 'Min-Max']values = [val_acc[max_idx], val_acc_standard[max_idx_standard], val_acc_minmax[max_idx_minmax]] fig, ax = plt.subplots()colors = ['#3498db', '#2ecc71', '#e74c3c']ax.bar(labels, values, color=colors, edgecolor='black', width=0.5)ax.set_ylim(0, 1.15)ax.set_yticks([0, 0.2, 0.4, 0.6, 0.8, 1.0])ax.set_xlabel("Normalization type", fontsize=14)ax.set_ylabel("Accuracy", fontsize=14)ax.set_title("Accuracy of models based on normalization", fontsize=16)ax.grid(axis='y', linestyle='--', alpha=0.7)for i, value inenumerate(values): ax.text(i, value +0.02, f'{value:.2f}', ha='center', fontsize=12)plt.show()
Summary
Finaly we want to meassure how accurately will our model using MinMax normalization and KNN perform on new data (data that the model has never seen). For this we can use the test dataset, which our model never seen during the training. Our model has 0.921 accuracy on test dateset, this is the accuracy that we can expect on new data. Based on the accuracy and on the confusion matrix shown below we can see that the model performed quite well
Before training the model on the train part of the dataset, we can try to look at the values to learn a little bit more about the dataset. The reason we only explore the train part of the dataset is that we don’t look at values from validating and test datasets because we want to treat them as “new” data that the model has not seen so that they can be used for accuracy estimates.
Code
df_original = Xtraindf_tmp = ytrain.copy()df_tmp = df_tmp.astype("category")df_tmp_counts = df_tmp.value_counts()custom_colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']fig = plt.figure(figsize=(15,8))ax1 = fig.add_subplot(1,2,1)sns.barplot(x = df_tmp_counts.index, y = df_tmp_counts.values, ax = ax1, order=df_tmp_counts.index)for i, bar inenumerate(ax1.patches): bar.set_facecolor(custom_colors[i])ax1.set_xlabel("Air quality type", fontsize =13)ax1.set_ylabel("Frequency",fontsize =13)ax1.set_title("Air quality type Frequency in Train dataset",fontsize =17)ax1.grid(axis='y', color='black', alpha=.2, linewidth=.5)animal_count = df_tmp.value_counts()ax2 = fig.add_subplot(1,2,2)ax2.pie(animal_count, labels=animal_count.index,autopct='%.0f%%', textprops={"fontsize": 13})ax2.set_title("Air quality type distribution in Train dataset", fontsize =17)
Text(0.5, 1.0, 'Air quality type distribution in Train dataset')